The analysis of the AirBnb datset have revealed valuable insights into various aspect of listings in New York City. The majority of the lsitings are in the "Entire home/apt" and "Private Room" categories indicating preference for privacy and independence. Manhattan and Brooklyn are the most popular neighbourhood with higher price and longer minimum stay requrirements compared to other areas. The analysis has also highlighted the importance of pricing and availability, with variations based on room types and neighbourhoods. Geographical analysis had identified specific regions with higher prices and longer minimum stay requirement, providing guidance for selecting suitable accomodations.
Based on these insights, several solutions have been purposed to address the business problems. These include the refining price strategies promoting underrepresented nieghbourhoods, enhancing search filter and user interface and improving guest experience. The findings also emphasize the significance of data-driven decision making and collaborations with local authorities to optimize operations and comply with regulations. By leveraging these insight, AirBnb can enhance the user experience attarct more host and guests and improve overall business performance in competetive short-term rental market in New York City.
Double-click (or enter) to edit
Performing Exploratory Data Analysis (EDA) on the AirBnb dataset poses several challenges due to its vast size, comprising millions of listings. Effective startegies are required to process and analyze the this extensive dataset to extract valuable insights and pattern. The goal is to uncover trends, patterns and key features of the data to enhance decision-making and improve the eoverall AirBnb experience for host and user.
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import missingno as msn
import folium
import warnings
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# Load Dataset
df = pd.read_csv('/content/Airbnb NYC 2019.csv')
# Dataset First Look
df.head().style.background_gradient(cmap='cool')
# Dataset Rows & Columns count
num_rows = df.shape[0]
num_columns = df.shape[1]
print("Dataset Rows count:", num_rows)
print("Dataset Columns count:", num_columns)
Dataset Rows count: 48895 Dataset Columns count: 16
# Dataset Info
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48895 entries, 0 to 48894 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 48895 non-null int64 1 name 48879 non-null object 2 host_id 48895 non-null int64 3 host_name 48874 non-null object 4 neighbourhood_group 48895 non-null object 5 neighbourhood 48895 non-null object 6 latitude 48895 non-null float64 7 longitude 48895 non-null float64 8 room_type 48895 non-null object 9 price 48895 non-null int64 10 minimum_nights 48895 non-null int64 11 number_of_reviews 48895 non-null int64 12 last_review 38843 non-null object 13 reviews_per_month 38843 non-null float64 14 calculated_host_listings_count 48895 non-null int64 15 availability_365 48895 non-null int64 dtypes: float64(3), int64(7), object(6) memory usage: 6.0+ MB
# Dataset Duplicate Value Count
d = df.duplicated().sum()
print(f'Dataset Duplicate Value Count is {d}')
Dataset Duplicate Value Count is 0
# Missing Values/Null Values Count
df.isnull().sum().sort_values(ascending = False)
last_review 10052 reviews_per_month 10052 host_name 21 name 16 id 0 host_id 0 neighbourhood_group 0 neighbourhood 0 latitude 0 longitude 0 room_type 0 price 0 minimum_nights 0 number_of_reviews 0 calculated_host_listings_count 0 availability_365 0 dtype: int64
# Visualizing the missing values
msn.bar(df,color='cyan')
Duplicate Values: The dataset does not contain any duplicate values. This implies that each row in dataset represents a unique record.
Missing Values: Some columns have missing values. The column 'name', 'host_name', 'last_review', and 'review_per_month' have missing values, as indicated by the count of non-null values being less than the total number of rows. These missing values may need to be handeled appropriately depending on the specific analysis or use case.
Data Types: The dataset contains the mixing of data types. It includes integer, float and object (string) data types. The data types provide information about the nature of the variables and can help determine appropriate statistical or analytics techniques for further analysis.
# Dataset Columns
df.columns
Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
'minimum_nights', 'number_of_reviews', 'last_review',
'reviews_per_month', 'calculated_host_listings_count',
'availability_365'],
dtype='object')# Dataset Describe
df.describe()
id: Listing ID(int64)
name: Name of the listing(object)
host_id: host ID(int64)
host_name: name of the host(objet)
nieghbourhood_group: area(object)
latitude: latitude co-ordinates(float64)
longitude: longitude co-ordinates(float64)
room_type: listing space type(object)
price: price in dollars(int64)
minimum_nights: amount of nights in days(int64)
number_of_reviews: number of reviews (int64)
last_review: latest review(object)
reviews_per_month: number of review per month(float64)
calculated_host_listings_count: amount of listings per host(int64)
availablity_365: number of days when listing is available(int64)
# Check Unique Values for each variable.
def unique_values(x):
return df[x].unique()
for i in df:
if i == 'neighbourbood' or i == 'reviews_per_month' or i == 'price':
continue
else:
print('-'*50)
print(''*50)
print('unique values of',i)
print(unique_values(i))
print('-'*50)
print('-'*50)
-------------------------------------------------- unique values of availability_365 [365 355 194 0 129 220 188 6 39 314 333 46 321 12 21 249 347 364 304 233 85 75 311 67 255 284 359 269 340 22 96 345 273 309 95 215 265 192 251 302 140 234 257 30 301 294 320 154 263 180 231 297 292 191 72 362 336 116 88 224 322 324 132 295 238 209 328 38 7 272 26 288 317 207 185 158 9 198 219 342 312 243 152 137 222 346 208 279 250 164 298 260 107 199 299 20 318 216 245 189 307 310 213 278 16 178 275 163 34 280 1 170 214 248 262 339 10 290 230 53 126 3 37 353 177 246 225 18 343 326 162 240 363 247 323 125 91 286 60 58 351 201 232 258 341 244 329 253 348 2 56 68 360 76 15 226 349 11 316 281 287 14 86 261 331 51 254 103 42 325 35 203 5 276 102 71 78 8 182 79 49 156 200 106 135 81 142 179 52 237 204 181 296 335 282 274 98 157 174 223 361 283 315 36 271 139 193 136 277 221 264 236 89 23 218 235 119 350 161 259 27 167 358 59 337 43 25 127 303 115 268 44 65 252 64 111 90 338 31 241 285 183 84 166 28 83 305 356 308 229 210 153 332 120 313 69 293 4 300 40 117 206 144 354 41 270 306 33 50 80 97 118 134 17 289 121 205 74 62 29 109 168 146 242 352 155 291 266 101 190 327 217 171 110 87 202 70 147 169 212 122 330 54 196 57 73 149 239 63 195 47 319 19 112 344 77 160 141 13 24 150 128 176 357 211 172 256 165 32 105 267 148 93 45 175 159 48 100 184 114 133 186 334 94 151 228 113 55 66 173 104 197 99 131 143 124 130 187 145 108 123 92 61 138 227 82] -------------------------------------------------- --------------------------------------------------
# Write your code to make your dataset analysis ready.
df.dropna(inplace = True) #drop all the NaN/Null/Missings values.
# Chart - 1 visualization code
# Distribution chart of dataset
# setting figure size
figure = plt.figure(figsize=(15,10))
# setting the axis
ax = figure.gca()
# creating the chart
df.hist(ax = ax, color = 'green')
# setting tittle and parameters
plt.title('Histogram of df features', fontsize = 15)
plt.xlabel('Value', fontsize = 12)
plt.ylabel('Frequency', fontsize = 12)
# display the figure
plt.show()
# Chart - 2 visualization code
# Distribution by room type
# Setting chart size
plt.figure(figsize = (10,6))
# setting data for chart creation
room_type_counts = df['room_type'].value_counts()
# Setting additional parameter
explode = (0.05, 0., 0.)
# Creating the chart
plt.pie(room_type_counts, labels = room_type_counts.index, autopct='%1.1f%%', explode = explode)
plt.axis('equal')
# Setting the title
plt.title('Room Types')
# Display the chart
plt.show()
Dominant Categories: The majority of listings fall into two main categories - 'Entire Home/ Apt.' and 'Pvt. Room'. 'Entire Home/Apt.' has the highest count of 52% listings, followed by 'Pvt. Room' with 47.5% listings.
Limited Shared Rooms: The count for 'Shared Rooms' listings is relatively low, with only 2.4% listings. This indicates that shared accomodations are less common in the dataset.
Preferance for Privacy: The higher counts of 'Entire Room/Apt.' and 'Pvt Room'suggest that guests tend to prefer accomodations that offer more privacy and indpendence.
This insights summarize the distribution of the room types and highlighted teh preference for privacy and independent accomodations in the datasets.
# Chart - 3 visualization code
# Countplot of Neighbourhood group
# Set the figure size
plt.figure(figsize= (15,3))
# Creating chart
ax = sns.countplot(data = df, x = 'neighbourhood_group')
# Setting labels
ax.set_xlabel('Neighbourhod Group')
ax.set_ylabel('Count')
# Displaying the chart
plt.show()
Manhattan and Brooklyn are the most represented neighbourhood groups, with 21,652 and 20,098 listings, respective.
Queens, Staten Island and Bronx have fewer listings, with 5,666, 373 and 1,090 listings resp.
Manhattan and Brooklyn are popular choices for AirBnb listings, potentially due to their attractions and demand for short-term rentals.
These insight highlight teh dominance of Manhttan and Brooklyn in the dataset, the relatively lower representation of Queens, Bronx and Staten Island and the popularity of accomodations in Manhattan and Brooklyn for AirBnb listings in the New York City.
# Chart - 4 visualization code
# Hisplot of Reviews per month
# Set figure size
plt.figure(figsize= (10,4))
# Creating the chart
sns.histplot(data=df, x = 'reviews_per_month',bins = 60)
plt.xlabel('reviews_per_month')
plt.ylabel('Density')
plt.title('Reviews Per Month')
# Displaying the Chart
plt.show()
Majority of reviews are nearly 1 and 90 percent of the data is falling below 5.
# Chart - 5 visualization code
# KDE plot for Distribution of price
# Setting chart size
plt.figure(figsize= (10,6))
# creating violin plot
sns.violinplot(data = df, x = 'price', color ='deepskyblue')
# Setting labels and other parameters
plt.xlabel('Price')
plt.ylabel('Density')
plt.title('Price Distribution')
# Display the plot
plt.show()
# mode of price
df['price'].mean(),df['price'].mode(),df['price'].median()
(142.33252621004095, 0 150 Name: price, dtype: int64, 101.0)
# Chart - 6 visualization code
# Displot the last reviews
# Setting the chart
plt.figure(figsize=(10,8))
# Creating Charts
sns.displot(data=df, x= 'last_review', bins=10, color = 'green')
# Setting labels and parameters
plt.xlabel('last_review')
plt.ylabel('Density')
plt.title('Last Review Distribution')
# Display the Chart
plt.show()
# Chart - 7 visualization code
# Bar plot for availability 365 and room type
# Setting chart size
plt.figure(figsize=(8,4))
# Creating charts
sns.barplot(x='availability_365', y= 'room_type', data = df)
# setting label and parameters
plt.xlabel('availability 365')
plt.ylabel('room type')
plt.title('Availability 365 for Room Type')
# Display the chart
plt.show()
Shared room availabilty: Shared room have a higher availability (162) comppared to the other room types.
Comparable availability: Entire home/apt and private rooms have similar availability, around 110.
Room type impact: Room type influence availability, with share room being more available.
Booking Consedration: Guest seeking shared rooms have more options throughout the year, while booking in advance maybe necessary for entire home/apt. and pvt. rooms.
These insight highlight the difference in availability based on room types, with shared rooms being more readily available the need for advanced booking for entire home/apt and private rooms.
# Chart - 8 visualization code
# Group by neighbourhood group and calculate the mean of price
grouped = df.groupby('neighbourhood_group')['price'].mean().reset_index()
# Sort the data by average 'price' in descending order
grouped = grouped.sort_values(by = 'price', ascending = False)
# create a bar plot to visulaize the average 'price' by 'neighbourhood_group'
plt.figure(figsize=(10,3))
sns.barplot(data= grouped, x = 'price', y = 'neighbourhood_group', orient= 'horizontal')
plt.title =('Average Price By Neighbourhood Group')
plt.xlabel('Average Price')
plt.ylabel('Neighbourhood Group')
# Display the Chart
plt.show()
Price for AirBnb listings vary significantly across neighbourhood groups. Manhattan has highest average price, followed by Staten Island, Brooklyn, Queens and Bronx.
Manhattan and Brooklyn are the most popular choices for AirBnb, with a large number of listings in both the area. Queens, Bronx and Staten Island has fewer listings compared to Manhattan and Brooklyn.
The bar plot visually shows the price differences, helping users make informed decisions about their accomodation based on their budget and preferences.
# Chart - 9 visualization code
# pie chart on base of minimum nights and neighbourhood group
# group by the 'neighbourhood_group' and calculate the mean of 'minimum nights'
plt.figure(figsize=(10,8))
explode = (0.05,0.05,0.05,0.05,0.05) #explode the slice by radius
df.groupby(df.neighbourhood_group).mean()['minimum_nights'].plot(kind='pie', figsize=(8,6),startangle=90, autopct='%.3f',shadow=True, explode = explode)
plt.ylabel('')# just let this empty for sizing
# display the chart
plt.show()
The average minimum nights required for AirBnb listings varies across different neighbourhood groups in New York City. Manhattan has the highest average minimum nights, followed by Queens, Brooklyn, Staten Island and the Bronx.
Manhattan has the longest average minimum nights, indicating a preferance for longer stays among visitors in this area. This could be due to the city's attraction and the desire for a more immersive experience.
Staten Island has the second highest average minimum nights, suggesting that visitors to this neighbourhood group also tend to stay for a relavility longer duration compared to other areas.
Brooklyn and Queens have similar avarage nights, indictaing visitors t these neighbourhood groups also opt for longer stays, through slightly shorter than those in Manhattan and Staten Island.
The Bronx has the lowest average minimum nights among all the neighbourhood groups, visitors can consider factors like price, desired length of stay and attractions when selecting their accomodation in New York City.
# Chart - 10 visualization code
# Scatter plot on minimum nights by price
plt.figure(figsize=(10,6))
sns.scatterplot(data=df, x='minimum_nights', y='price' )
plt.xlabel('Minimum Nights')
plt.ylabel('Price')
plt.show()
Shorter satys (less than 5-6 nights)have varying and discrete prices, indicating price variability.
Longer stays show a more consistent pricing structure, forming a nearly horizontal line above zero.
Prices tends to decrease as the number of minimum nights increases, indicating a potential correlation between stay duration and overall cost.
# Chart - 11 visualization code
# top Hotels as per Review
plt.figure(figsize=(10,6))
# Sorting Reviews as per Descending Order
sorted_data=df.sort_values(by='number_of_reviews', ascending = False)[:30]
sns.barplot(x = sorted_data['number_of_reviews'], y = sorted_data['name'],palette='viridis')
plt.xlabel('Count')
plt.ylabel('New York City AirBnb Hotels')
plt.show()
Overall, the barplot highlights the top hotels with the most reviews, showcasing a mix of accomodation in various neighbourhood of New York City. This info can be helpful for travellers looking for highly reviewed option and popular destinations within the city.
The average price tends to be higher for hosts with a calculated_host_listings_count in the range of 10.0-15.0, approx around 25.0.
Hosts ith a calculated_host_listings_count in the range of 0.0-5.0 have an average price of approx 160. For hosts with calculated_host_listings_count in the range of 5.0-10.0, the average price is around 100.
The calculated_host_listings_count range of 15.0-20.0 has a relatively lower average price, approx 120.
# Chart - 13 visualization code
# Longitude VS Latitude VS Price
plt.figure(figsize=(10,6))
# Set labels
plt.scatter(df['longitude'],df['latitude'],df['price'], cmap='virdis', alpha=0.6)
plt.colorbar(label='price')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
#display the chart
plt.show()
Dense Concentrations: There is a denser concentraation of data points within the latitude range of -74.0 to -73.9. This indicate the higher density of AirBnb listing in this specific region.
Rising Trend: As we move from left to right in the chart within the specified region, there is a risisng trend. This suggests that properties located towards the eastern part of the region tend to have longer minimum saty requirements.
Higher Minimum Nights: Within the concentrated region, the majority of data points exhibit higher minimum stay requirements.
Sparse Data and Lower Minimum Nights Outside the Region: Outside the specified latitude and longitude range, there a fewer data points, indicating a lower density of AirBnb listings. Additionally, the minimum night to be tend to be lower in these area compared to the concentrated region.
# Correlation Heatmap visualization code
plt.figure(figsize=(10,6))
# Labeling
sns.heatmap(df[['price','minimum_nights', 'availability_365']].corr(), annot=True)
plt.show()
# Values as per chart \
df[['price','minimum_nights','availability_365']].corr().head().style.background_gradient(cmap='Oranges')
Overall, from the above heatmap reveals the mix of week positive and negative correlations among the variables. It suggests that the number of reviews and review rate per month have the srongest relationship among the variables, while the variables, while the other variables have weaker or negligible correlations with each other.
# Pair Plot Visualization code
sns.pairplot(df)
Explain Briefly.
To achieve the business objective of Airbnb booking analysis, the client should adopt a data-driven approach that leverages comprehensive data insights to optimize the booking platform and enhance customer experience. Here's a step-by-step plan:
Data Collection: Gather a wide range of data on Airbnb bookings, including listing details, host information, pricing, guest reviews, and booking patterns. This data can be obtained from internal databases or through web scraping publicly available sources.
Data Preprocessing: Thoroughly clean and preprocess the collected data to remove duplicates, handle missing values, and correct errors. Ensuring data quality is crucial for accurate analysis.
Exploratory Data Analysis (EDA): Conduct an in-depth EDA to gain insights into the data. Identify trends, seasonal patterns, and correlations between different factors that impact bookings. Understanding these patterns can help target specific areas for improvement.
Market Segmentation: Segment the data based on key variables such as location, property type, and price range. Analyzing different market segments allows for tailored strategies to meet the unique demands of each segment.
Demand Forecasting: Implement time series forecasting models to predict future booking demand. This will aid hosts in optimizing pricing and availability, ultimately maximizing occupancy and revenue.
Sentiment Analysis: Perform sentiment analysis on guest reviews to assess customer satisfaction. Understanding guests' feedback will highlight strengths and weaknesses, leading to better service and increased positive reviews.
Price Optimization: Utilize pricing algorithms to optimize listing prices based on demand, competitor pricing, and seasonal variations. Optimized pricing ensures competitiveness and attracts more bookings.
Host Performance Analysis: Evaluate host performance using metrics like occupancy rates, ratings, and guest feedback. Recognize top-performing hosts and incentivize improvements in service for others.
Competitive Analysis: Analyze competitors' offerings and market share to identify opportunities for differentiation. Understanding the competitive landscape can lead to strategies that set Airbnb apart in the market.
Business Recommendations: Based on the insights gained from the analysis, provide actionable recommendations to enhance overall booking performance, customer satisfaction, and revenue generation. These recommendations may include personalized marketing strategies, service enhancements, and targeted promotions.
By implementing these steps, Airbnb can optimize its booking platform, attract more guests, retain satisfied hosts, and ultimately achieve its business objective of maximizing bookings and revenue while delivering exceptional customer experiences. Data-driven decisions will be the cornerstone of success in the dynamic and competitive vacation rental market.
The Airbnb Booking Analysis provides valuable data-driven insights to optimize the platform and enhance customer experience. Through comprehensive data collection, exploratory analysis, and demand forecasting, Airbnb can implement pricing algorithms, recognize top-performing hosts, and identify opportunities for differentiation. Sentiment analysis of guest reviews enables improvements in service, ultimately maximizing bookings and revenue. The competitive analysis helps Airbnb stay ahead in the market. By acting on the recommended strategies, Airbnb can attract more guests, retain satisfied hosts, and achieve its business objective of sustained growth and success in the vacation rental industry.